Let’s get started with ggplot2
An example: GDP and Life Expectancy
library(ggplot2)
library(gapminder) # 'gapminder' package contains the data
gapminder # Let's take a look at the data
Another look at the data frame
str(gapminder) # str() is a good way to look at the data frame
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
Simple Scatterplot
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) # nothing to plot yet!

Simple Scatterplot
We can make the graph into an object to alter and add stuff later:
p <- ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp))
Simple Scatterplot
p + geom_point() # Now we tell ggplot that we want a satter plot

Simple Scatterplot
ggplot( data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point()

# Of course, we can write that in one swoop
Let’s keep that scale setting
p <- p + scale_x_log10()
Map continent variable to aesthetic color
p + geom_point(aes(color = continent))

To recap: full plot command thus far
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent)) + scale_x_log10()
Note, we put the aes() in the geom_point() element. We will see in a bit why.
Reduce overplotting
p + geom_point(aes(color = continent), alpha = 0.3, size=3)

# Setting transparency of points
Adding fitted curve
p + geom_point(aes(color = continent)) + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

# by default, adds a loess fit
Adding fitted curve
p + geom_point(aes(color = continent)) +
geom_smooth(color="black", lwd=2, se=FALSE)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

# removing the confidence intervals
We could exchange the order of the layers
p + geom_smooth(color="black", lwd=2, se=FALSE) +
geom_point(aes(color = continent))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Use a linear fit instead
lm, glm, gam, loess, rlm
p + geom_point(aes(color = continent)) + geom_smooth(method="lm")

Smooth fit by continent
p + geom_point(aes(color = continent)) +
geom_smooth(lwd = 2, se = FALSE, aes(color = continent))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Now all layers use the continent grouping
# We could add the aes() grouping to the overall graph p
p <- p + aes(color = continent)
p + geom_point() +
geom_smooth(lwd = 2, se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Why another color=continent?
# Our original plot command:
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point() + scale_x_log10()
# A single smoothed line through all points:
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point() + scale_x_log10() + geom_smooth()
# Using the color aesthetic for the smoothing as well as the scatter points
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() + scale_x_log10() + geom_smooth(lwd=2, se=FALSE)
# Still single black smoothed line but now points are colored by continent:
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(aes(color = continent)) + scale_x_log10() +
geom_smooth(color="black")
Grammar of Graphics
The Grammar of Graphics
- ggplot is based on a “grammar” of graphics, an idea originated with Wilkinson (2005)
Main principles
- there are few main principles:
- Graphics = distinct layers of grammatical elements (or grammar rules) that map pieces of data to geometric objects (like lines and points) and attributes (like color and size)
- if necessary some additional rules about scales, projections in a coordinate system, and data transformations are possible
- Plots arise through aesthetic mapping
- The grammar produces “sentences” (mappings of data to objects) but they can easily be garbled if you define poor mappings.
Three key grammatical elements
| Data |
The dataset being plotted. |
| Aesthetics |
The scales onto which we map our data. |
| Geometries |
The visual elements used for our data. |
- every ggplot2 plot has these three key components
All seven grammatical elements
| Data |
The dataset being plotted. |
| Aesthetics |
The scales onto which we map our data. |
| Geometries |
The visual elements used for our data. |
| Facets |
Plotting subsets of the data. |
| Statistics |
Statistical representations of our data to aid understanding. |
| Coordinates |
The space on which the data will be plotted. |
| Themes |
All non-data ink. |
A diagram of the graphical elements
ggplot2 layers: data

ggplot2 layers: data
gapminder
ggplot2 layers: aesthetics

ggplot2 layers: aesthetics
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp))
ggplot2 layers: geometries

ggplot2 layers: geometries
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(alpha=0.5, size=3)

ggplot2 layers: facets

ggplot2 layers: facets
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(alpha=0.5, size=3) +
facet_grid( . ~ continent)

ggplot2 layers: statistics

ggplot2 layers: statistics
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(alpha=0.5, size=3) +
facet_grid( . ~ continent) +
geom_smooth(color="black", lwd=1, se=TRUE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot2 layers: coordinates

ggplot2 layers: coordinates
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(alpha=0.5, size=3) +
facet_grid( . ~ continent) +
geom_smooth(color="black", lwd=1, se=TRUE) +
scale_x_log10("GDP per Capita") +
scale_y_continuous("Life Expectancy in Years")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot2 layers: theme

ggplot2 layers: theme
theme_tufte(), theme_classic(), theme_minimal()
library(ggthemes)
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(alpha=0.5, size=3) +
facet_grid( . ~ continent) +
geom_smooth(color="black", lwd=1, se=TRUE) +
scale_x_log10("GDP per Capita") +
ylab("Life Expectancy in Years") +
theme_tufte() + theme(legend.position="none")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot2 layers: theme
library(ggthemes)
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(alpha=0.5, size=3) +
facet_grid( . ~ continent) +
geom_smooth(color="black", lwd=1, se=TRUE) +
scale_x_log10("GDP per Capita",
labels = scales::trans_format("log10",
scales::math_format(10^.x))) +
ylab("Life Expectancy in Years") +
theme_economist() +
theme(legend.position="none") +
ggtitle("The relationship between wealth and longevity")
ggplot2 layers: theme
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The Plot-Making Process in ggplot
Understanding the layers of ggplot2
Recall: the three key grammatical elements:
| Data |
The dataset being plotted. |
| Aesthetics |
The scales onto which we map our data. |
| Geometries |
The visual elements used for our data. |
Let’s take a closer look at these now.
Recall: Aesthetic vs. Attributes
- an attribute is simply a setting of things like color, shape, size etc. independent of what the data looks like
- in contrast, in the aesthetics layer, we map features of the data onto visible aesthetics
Recall: Setting Attributes
Here we set three attributes of the points: alpha, size, color
library(ggplot2)
library(gapminder)
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
scale_x_log10() +
geom_point(alpha=0.5, size=3, color="red")

Mapping onto shape
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, shape=continent)) +
geom_point(size=3, alpha=0.3) + scale_x_log10()

Mapping onto size
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, size=pop)) +
scale_x_log10() +
geom_point(alpha=0.3)

Recall: Combining mappings
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ggplot(subset(gapminder, continent %in% c("Americas","Europe")),
aes(x = gdpPercap, y = lifeExp, size=year,
color=continent, shape=continent)) +
scale_x_log10() + geom_point(alpha=0.3)

Typical aesthetics
| x |
X axis position |
| y |
Y axis position |
| colour |
Colour of dots, outlines of other shapes |
| fill |
Fill colour |
| size |
Diameter of points, thickness of lines |
| alpha |
Transparency |
| linetype |
Line dash pattern |
| labels |
Text on a plot or axes |
| shape |
Shape |
Aesthetics and Geoms
- each
geom() layer allows you to set the aesthetics that make sense for the particular plot geom()
- for example,
geom_point understands the following aesthetics: x, y, alpha, color, fill, group, shape, size, stroke. For geom_point() the aesthetics x and y are required.
- some aesthetics are limited to continous variables, others to categorical variables
Aesthetics - Continuous Variables
| x |
X axis position |
| y |
Y axis position |
| colour |
Colour of dots, outlines of other shapes |
| fill |
Fill colour |
| size |
Diameter of points, thickness of lines |
| alpha |
Transparency |
linetype |
Line dash pattern |
labels |
Text on a plot or axes |
shape |
Shape |
Aesthetics - Continuous Variables
ggplot(filter(gapminder,year==2007),
aes(x = gdpPercap, y = lifeExp, size=pop)) +
scale_x_log10() + geom_point(alpha=0.3) +
scale_size_continuous(name="pop", range = c(1,20))

Aesthetics - Continuous Variables
d <- filter(gapminder, year %in% c(1967,1977,1987,1997,2007))
ggplot(d, aes(x = gdpPercap, y = lifeExp, color=pop)) +
scale_x_log10() + geom_point(alpha=0.3, size=3)

Aesthetics - Continuous Variables
- size works clearly better than color in this case
- there are general guides about which types of aesthetics work better for which kind of variables – these are rooted in our understanding of visual perception as we have seen earlier.
Aesthetics - Categorical Variables - Mapping onto shape
ggplot(d, aes(x = gdpPercap, y = lifeExp, shape=continent)) +
scale_x_log10() + geom_point(alpha=0.3, size=4)

Adding redundant channel to emphasize
ggplot(d, aes(x = gdpPercap, y = lifeExp, shape=continent)) +
scale_x_log10() + geom_point(alpha=0.3, size=4) +
geom_point(data=filter(d, continent=="Americas"),
color="red", alpha=0.5, size=4) + theme(legend.position="none")

Encircle to emphasize
# ggalt() includes the encircle() function
# devtools::install_github("hrbrmstr/ggalt", force=FALSE)
library(ggalt)
previousplot + geom_encircle(data=filter(d, country=="United States"),
expand=0.05, color="blue", linetype=2, size=2)

Connect to emphasize
library(ggthemes)
ggplot(d, aes(x = gdpPercap, y = lifeExp, shape=continent)) +
scale_x_log10() +
geom_path(data=filter(d, country=="United States"),
color="light blue", linetype=1, size=6) +
geom_path(data=filter(d, country=="Venezuela"),
color="light green", linetype=1, size=6) +
geom_path(data=filter(d, country=="Haiti"),
color="orange", linetype=1, size=6) +
geom_point(alpha=0.3, size=4) +
geom_point(data=filter(d, continent=="Americas"),
color="red", alpha=0.5, size=4) +
theme(legend.position="none") +
annotate("text", x = c(40000), y = c(73), size=6,
color="dark blue", label = c("United States")) +
annotate("text", x = c(13000), y = c(63), size=6,
color="dark green", label = c("Venezuela")) +
annotate("text", x = c(1200), y = c(62), size=6,
color="dark orange", label = c("Haiti")) + theme_tufte()
Connect to emphasize

Box plots and Dot Plots
- For some plots we have a specific
geom(). E.g. box plots are created with geom_boxplot.
- For other plots we can use the geoms we already know. E.g. for dot plots we can use
geom_point()
- overall 37 geoms, but good to know a few. Use the ggplot2 cheat sheet.
Examples: Geoms and Type of Plot
| scatterplot |
point |
|
| bubblechart |
point |
size mapped to a variable |
| barchart |
bar |
|
| box-and-whisker plot |
boxplot |
|
| line chart |
line |
|
A New Dataset - Organ Donors
organs <- read.csv("organ_donors.csv")
dim(organs)
## [1] 238 21
head(organs)
## For convenience, let R know year is a time measure.
organs$year <- as.Date(strptime(organs$year, format="%Y"))
Let’s take a quick look
- Let explore the data a bit with some plots.
p <- ggplot(data=organs,
aes(x=year,
y=donors))
p + geom_point()
## Warning: Removed 34 rows containing missing values (geom_point).

Some lineplots again
p + geom_line(aes(group=country,
color=consent.law)) +
scale_color_manual(values=c("gray40", "firebrick")) +
scale_x_date() +
labs(x="Year",
y="Donors",
color="Consent Law") +
theme(legend.position="top")
## Warning: Removed 34 rows containing missing values (geom_path).

Faceting
- We can also split the plot by some factor, called faceting
# ggplot has two faceting functions that do slightly different things: `facet_grid()`, seen here, and `facet_wrap()`. Try them out on the Gapminder data.
p + geom_line(aes(group=country)) +
labs(x="Year",
y="Donors") +
facet_grid(.~consent.law)
## Warning: Removed 34 rows containing missing values (geom_path).

A quick bit of data manipulation - Average by group
library(dplyr)
by.country <- organs %>% group_by(consent.law, country) %>%
summarize(donors=mean(donors, na.rm = TRUE))
by.country
Ordered dotplots
p <- ggplot(by.country, aes(x=donors, y=country, color=consent.law))
p + geom_point(size=3)

- Note, we are using
geom_point() again.
- How can we improve this graph?
Ordering
We know that order helps visual perception.
p <- ggplot(by.country, aes(x=donors, y=reorder(country,donors),
color=consent.law))
p + geom_point(size=3)

- Get your factors (the categorical variable) in order when it makes sense.
Improve the labels
p + geom_point(size=3) +
labs(x="Donor Procurement Rate (per million population)",
y="", color="Consent Law") +
theme(legend.position="top")

Another way
p <- ggplot(by.country, aes(x=donors, y=reorder(country, donors)))
p + geom_point(size=3) +
facet_grid(consent.law ~ ., scales="free") +
labs(x="Donor Procurement Rate (per million population)",
y="",
color="Consent Law") +
theme(legend.position="top")

Boxplot
p <- ggplot(data=organs,aes(x=country,y=donors))
p + geom_boxplot() +
coord_flip() + # This is one way to get a horizontal box plot
labs(x="", y="Donor Procurement Rate")
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

Boxplot
p <- ggplot(data=organs,aes(x=reorder(country, donors, na.rm=TRUE), y=donors))
p + geom_boxplot() + coord_flip() +
labs(x="", y="Donor Procurement Rate")
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

Boxplot
p <- ggplot(data=organs,aes(x=reorder(country, donors, na.rm=TRUE),y=donors))
p + geom_boxplot(aes(fill=consent.law)) +
coord_flip() + labs(x="", y="Donor Procurement Rate")
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

Boxplot - Add some jitter
# Can combine jitter and boxplot if needed
ggplot(data=organs,aes(x=reorder(country, donors, na.rm=TRUE),y=donors)) +
geom_boxplot(aes(fill=consent.law), outlier.colour="transparent", alpha=0.3) +
coord_flip() + labs(x="", y="Donor Procurement Rate") +
geom_jitter(shape=21, aes(fill=consent.law), color="black",
position=position_jitter(w=0.1))
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
## Warning: Removed 34 rows containing missing values (geom_point).

1-D point summaries
p <- ggplot(data=organs, aes(x=reorder(country, donors, na.rm=TRUE), y=donors))
p + geom_point(aes(color=consent.law)) +
coord_flip() + labs(x="", y="Donor Procurement Rate")
## Warning: Removed 34 rows containing missing values (geom_point).

Add a little jitter
p <- ggplot(data=organs,aes(x=reorder(country, donors, na.rm=TRUE), y=donors))
p + geom_jitter(aes(color=consent.law)) + coord_flip() +
labs(x="", y="Donor Procurement Rate")
## Warning: Removed 34 rows containing missing values (geom_point).

Fine-tune the jittering
p <- ggplot(data=organs, aes(x=reorder(country, assault, na.rm=TRUE), y=assault))
p + geom_jitter(aes(color=world),
position = position_jitter(width=0.15)) +
coord_flip() +
labs(x="", y="Assault") +
theme(legend.position="top")
## Warning: Removed 17 rows containing missing values (geom_point).

A few more useful geoms
p + geom_point() + ggtitle("point")
p + geom_text() + ggtitle("text")
p + geom_bar(stat = "identity") + ggtitle("bar")
p + geom_tile() + ggtitle("raster")
p + geom_line() + ggtitle("line")
p + geom_area() + ggtitle("area")
p + geom_path() + ggtitle("path")
p + geom_polygon() + ggtitle("polygon")
Familiarize yourself with these options in ggplot2().
Thank you!